NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Hansen, Nicklas; V, Jyothir S; Sobal, Vlad; LeCun, Yann; Wang, Xiaolong; Su, Hao (April 2025, The International Conference on Learning Representations (ICLR 2025))

Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.
more » « less
Free, publicly-accessible full text available April 24, 2026
X -Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

Sobal, Vlad; Ibrahim, Mark; Balestriero, Randall; Cabannes, Vivien; Bouchacourt, Diane; Astolfi, Pietro; Cho, Kyunghyun; LeCun, Yann (April 2025, The International Conference on Learning Representations (ICLR 2025))

Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called X -Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of on ImageNet and on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of - \% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.
more » « less
Free, publicly-accessible full text available April 24, 2026
Catalyzing next-generation Artificial Intelligence through NeuroAI

https://doi.org/10.1038/s41467-023-37180-x

Zador, Anthony; Escola, Sean; Richards, Blake; Ölveczky, Bence; Bengio, Yoshua; Boahen, Kwabena; Botvinick, Matthew; Chklovskii, Dmitri; Churchland, Anne; Clopath, Claudia; et al (December 2023, Nature Communications)

Abstract Neuroscience has long been an essential driver of progress in artificial intelligence (AI). We propose that to accelerate progress in AI, we must invest in fundamental research in NeuroAI. A core component of this is the embodied Turing test, which challenges AI animal models to interact with the sensorimotor world at skill levels akin to their living counterparts. The embodied Turing test shifts the focus from those capabilities like game playing and language that are especially well-developed or uniquely human to those capabilities – inherited from over 500 million years of evolution – that are shared with all animals. Building models that can pass the embodied Turing test will provide a roadmap for the next generation of AI.
more » « less
Full Text Available
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Dou, Zi-Yi; Kamath, Aishwarya; Gan, Zhe; Zhang, Pengchuan; Wang, Jianfeng; Li, Linjie; Liu, Zicheng; Liu, Ce; LeCun, Yann; Peng, Nanyun; et al (October 2022, NeurIPS)

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.
more » « less
Full Text Available
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (January 2019, International Conference on Learning Representations (ICLR))

Full Text Available
The Mind of a Mouse

https://doi.org/10.1016/j.cell.2020.08.010

Abbott, Larry F.; Bock, Davi D.; Callaway, Edward M.; Denk, Winfried; Dulac, Catherine; Fairhall, Adrienne L.; Fiete, Ila; Harris, Kristen M.; Helmstaedter, Moritz; Jain, Viren; et al (September 2020, Cell)
null (Ed.)
Full Text Available

Search for: All records